This course will give a basic introduction on (untargeted) preprocessing of LC-MS (also extendable to GC-MS or LC-MS/MS) based metabolomics. The methods and workflow presented in this course our mainly based on the workflow presented by the xcms package on Bioconductor.
For this course, you will need R version > 4.4.
If you haven’t already, the packages needed to complete today’s course can be installed by running the next code block:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("xcms")
BiocManager::install("MsExperiment")
BiocManager::install("Spectra")
And load the required packages:
First, let’s do a quick review of what kind of data we are dealing with. LC-MS data is generated by first separating chemical molecules based on certain properties (e.g. polarity). Afterwards the molecules are ionized and in the MS instrument, they are separated based on their mass-to-charge ratio or m/z and their intensity (abundance) is measured. Thus, molecules are separated in two different dimensions, the retention time dimension (from the LC) and the mass-to-charge dimension (from the MS) making it easier to measure and identify molecules in more complex samples.
The workflow presented by xcms (but also applicable in for example
MZmine3 and most untargeted algorithms) is:
In this course we are using a subset of the data presented in Lloyd-Price et al. (2019). The subset is available to download via Metaboanalyst (https://new.metaboanalyst.ca/MetaboAnalyst/upload/SpectraUpload.xhtml) and contains 10 spectra (UPLC-Q/E-ESI-, C18) organized into three groups (Healthy, Crohn’s Disease and QC). The data is already available in the open-source .mzML format.
# Read the file names from the data folder
files <- list.files("data/",full.names = TRUE,pattern = ".mzML")
# Import the available metdata with the groups
metadata <- read.table("data/trimmed_metadata.txt",header = TRUE)
# Re-order metadata to match the files sequence
metadata <- metadata[order(metadata$Sample),]
#' Import the data of the experiment
ms_data <- readMsExperiment(files, sampleData = metadata)
Our object contains 4461 MS1 spectra in 10 samples
ms_data
## Object of class MsExperiment
## Spectra: MS1 (4461)
## Experiment data: 10 sample(s)
## Sample data links:
## - spectra: 10 sample(s) to 4461 element(s).
Various functions can be used to access specific data from the object:
sampleData(ms_data)
## DataFrame with 10 rows and 3 columns
## Sample Disease spectraOrigin
## <character> <character> <character>
## 1 CD-6KUCT Control C:\\Users\\p...
## 2 CD-77FXR Control C:\\Users\\p...
## 3 CD-9OS5Y Control C:\\Users\\p...
## 4 CD-9WOBP Control C:\\Users\\p...
## 5 HC-9SNJ4 Disease C:\\Users\\p...
## 6 HC-9X47O Disease C:\\Users\\p...
## 7 HC-AMR37 Disease C:\\Users\\p...
## 8 HC-AUP8B Disease C:\\Users\\p...
## 9 QC_PREFA02 QC C:\\Users\\p...
## 10 QC_PREFB02 QC C:\\Users\\p...
spectra(ms_data)
## MSn data (Spectra) with 4461 spectra in a MsBackendMzR backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 480.036 1
## 2 1 480.303 2
## 3 1 480.571 3
## 4 1 480.838 4
## 5 1 481.106 5
## ... ... ... ...
## 4457 1 598.670 442
## 4458 1 598.997 443
## 4459 1 599.265 444
## 4460 1 599.532 445
## 4461 1 599.800 446
## ... 33 more variables/columns.
##
## file(s):
## CD-6KUCT.mzML
## CD-77FXR.mzML
## CD-9OS5Y.mzML
## ... 7 more files
MS instruments allow to export data in profile or centroid mode. Profile data contains the signal for all discrete m/z values (and retention times) for which the instrument collected data (see first figure below). MS instruments continuously sample and record signals, therefore a mass peak for a single ion in one spectrum will consist of multiple intensities at discrete m/z values. The process to reduce this distribution of signals to a single representative mass peak (the centroid, see second figure below) is called centroiding.
In this case, we need to inspect if the data we are analyzing is profile data (and thus still needs to be centroided for use with xcms) or not. Usually, centroiding is performed when converting the data from the proprietary data fromat to .mzML, but it can also be performed in R (see https://jorainer.github.io/xcmsTutorials/articles/xcms-preprocessing.html#centroiding-of-profile-ms-data).
We can easily check if the data is centroided using the function
isCentroided. Checking in one file should be
sufficient.
isCentroided(spectra(ms_data[1]))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [106] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [136] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [151] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [166] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [181] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [196] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [211] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [226] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [241] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [256] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [286] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [301] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [316] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [331] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [346] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [361] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [376] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [391] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [406] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [421] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [436] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
The function returns TRUE for all spectra in the sample.
Visually, we can check some of the centroided data for a known compound in the data (see suppplementary info of the Lloyd-Price paper).
| Method | mz | RT (min) | Metabolite | |
|---|---|---|---|---|
| C18n | 538.3212 | 8.66 | Met - Cholate |
#Here we extract the m/z and RT range for our compound in the two QC samples using filterRT & filterMzRange, afterwards we plot.
ms_data[c(9,10)] |>
filterSpectra(filterMzRange,538.3212 + c(-0.005, 0.005)) |>
filterSpectra(filterRt,519.6 + c(-10,10)) |>
plot()
We can inspect the total ion chromatogram using
chromatogram. Setting aggregationFun = “sum” allows to
calculate the total ion chromatogram (TIC), aggregationFun = “max” the
base peak chromatogram (BPC).
#' Extract and plot a BPC
tic <- chromatogram(ms_data, aggregationFun = "sum")
plot(tic,main = "Total Ion Chromatogram")
We can also create boxplots representing the distribution of the total
ion currents per data file. Such plots can be very useful to spot
potentially problematic MS runs.
library(RColorBrewer)
## Generate a color set for the three groups
group_colors <- paste0(brewer.pal(3, "Set1")[1:3], "60")
names(group_colors) <- c("Control", "Disease","QC")
## Get the total ion current by file
tc <- spectra(ms_data) |>
tic() |>
split(f = fromFile(ms_data))
boxplot(tc, col = group_colors[sampleData(ms_data)$Disease],
ylab = "intensity", main = "Total ion current")
In this section, we will go through the 3 main steps of LC-MS preprocessing: peak picking, alignment and correspondence (and gap filling). Remember that you should tailor the parameters and settings to your specific dataset in you are refering to this document in the future.
Chromatographic peak detection aims to identify peaks along the retention time axis that represent the signal from individual compounds’ ions. This involves identifying and quantifying such signals as shown in the sketch below.
Within xcms, several peak detection algorithms can be used and
accessed by their respective parameter objects:
MatchedFilterParam, CentWaveParam and
MassifquantParam. However, in this workshop we will focus
on the centWave algoritm (see the original publication (Tautenhahn,
Böttcher, and Neumann 2008) for more details) as it is most frequently
used in LC-MS metabolomics applications.
The different parameters available for the centWave algorithm can be
called with CentWaveParam
CentWaveParam()
## Object of class: CentWaveParam
## Parameters:
## - ppm: [1] 25
## - peakwidth: [1] 20 50
## - snthresh: [1] 10
## - prefilter: [1] 3 100
## - mzCenterFun: [1] "wMean"
## - integrate: [1] 1
## - mzdiff: [1] -0.001
## - fitgauss: [1] FALSE
## - noise: [1] 0
## - verboseColumns: [1] FALSE
## - roiList: list()
## - firstBaselineCheck: [1] TRUE
## - roiScales: numeric(0)
## - extendLengthMSW: [1] FALSE
## - verboseBetaColumns: [1] FALSE
It is strongly discouraged to use these default parameter settings for any of your preprocessing! There are tools available that automatically configure these parameters based on your data (such as IPO and autoTuner), but generally, the best results are obtained by tuning the parameters manually to your data and ions of interest.